\[\begin{align} Pr(y=+1|x_1,x_2) &= \frac{p(x_1,x_2|y=+1) Pr(y=+1)}{p(x_1,x_2)} \end{align}\]
\[\begin{equation} Pr(y=+1|x_1,x_2) > Pr(y=-1|x_1,x_2) \end{equation}\]
\[\begin{equation} \hat{Pr}(y=+1|x_1,x_2) > \hat{Pr}(y=-1|x_1,x_2) \end{equation}\]
instead of directly modelling data generation process, model the classification boundary via hyperplane:
\[\begin{equation} w.x - b = 0 \end{equation}\]
where \(w\in \mathbb{R}^2\), \(x\in\mathbb{R}^2\) and \(b\in\mathbb{R}\)
classification rule:
\[\begin{equation} f(x) = \text{sign}(w.x - b) \end{equation}\]
so if \(w.x - b > 0\implies f(.)=+1\); else \(f(.)=-1\)
for linearly separable data \((x_i,y_i)\) for \(i=1,...,K\):
\[\begin{align} w.x_i - b &\geq +1, \; \text{ if } y_i = +1\\ w.x_i - b &\leq -1, \; \text{ if } y_i = -1\\ \end{align}\]
or, compactly:
\[\begin{equation} y_i (w.x_i - b) \geq 1 \end{equation}\]
\[\begin{equation} x' = x + wt \end{equation}\]
implying:
\[\begin{equation} w.x' = w.(x + wt) = 1 + b \end{equation}\]
which, when rearranged, yields:
\[\begin{equation} |w|^2 t = 1 - b - w.x = 1 + b - b = 1 \end{equation}\]
So
\[\begin{equation} t = \frac{1}{|w|^2} \end{equation}\]
meaning that:
\[\begin{equation} x' = x + w \frac{1}{|w|^2} \end{equation}\]
and the distance is given by:
\[\begin{equation} |x'-x| = \frac{1}{|w|} \end{equation}\]
Optimal hyperplane equation is solution of:
\[\begin{equation} \text{min }|w|, \text{s.t. } y_i (w.x_i - b) \geq 1, \text{ for } i = 1,...,K \end{equation}\]
which is actually easier to solve as a quadratic programming problem:
\[\begin{equation} \text{min }\frac{1}{2}|w|^2, \text{s.t. } y_i (w.x_i - b) \geq 1, \text{ for } i = 1,...,K \end{equation}\]
\[\begin{equation} y_i (w.x_i - b) \geq 1 \end{equation}\]
no longer possible for \(i = 1,...,K\). Instead define the following hinge loss function:
\[\begin{equation} L = \frac{1}{K}\sum_{i=1}^{K}\text{max}(0, 1 - y_i (w.x_i - b)) \end{equation}\]
here \(L_i=0\) is points are correctly classified; and \(L_i = 1 - y_i (w.x_i - b)\) measures distance from decision boundary for misclassified points
\[\begin{equation} H = C|w|^2 + \frac{1}{K}\sum_{i=1}^{K}\text{max}(0, 1 - y_i (w.x_i - b)) \end{equation}\]
where \(C\) controls importance of having a large margin between (most of) both classes. \(C\) is chosen as part of hyperparameter optimisation
Suppose
\[\begin{equation} y_i|x_i \sim \mathcal{N}(\alpha + \beta x_i, \sigma), \end{equation}\]
where \((\alpha, \beta, \sigma)\) are parameters to be learned
assume \(\boldsymbol{y}=(y_1,y_2,...,y_K)\) and \(\boldsymbol{x}=(x_1,x_2,...,x_K)\)
\[\begin{equation} p(\alpha,\beta,\sigma|\boldsymbol{y},\boldsymbol{x}) \propto p(\boldsymbol{y}|\boldsymbol{x},\alpha, \beta,\sigma) p(\alpha,\beta,\sigma) \end{equation}\]
use these to determine posterior predictive distribution describing new \(\tilde{\boldsymbol{y}}\) for given (new) \(\tilde{\boldsymbol{x}}\):
\[\begin{equation} p(\tilde{\boldsymbol{y}}|\tilde{\boldsymbol{x}}) = \int p(\tilde{\boldsymbol{y}}|\tilde{\boldsymbol{x}},\alpha, \beta,\sigma) p(\alpha,\beta,\sigma|\boldsymbol{y},\boldsymbol{x}) d\alpha\;d\beta\;d\sigma \end{equation}\]
typically would be done via Markov chain Monte Carlo
\[\begin{equation} y_i = \alpha + \beta x_i + \epsilon_i \end{equation}\]
where \(\epsilon_i\) is an error term. Define mean-squared loss:
\[\begin{equation} L = \frac{1}{K} \sum_{i=1}^{K} (y_i - (\alpha + \beta x_i))^2 \end{equation}\]
determine \(\hat{\alpha}\) and \(\hat{\beta}\) as those minimising \(L\):
\[\begin{align} \frac{\partial L}{\partial \alpha} &= -\frac{2}{K}\sum_{i=1}^{K} (y_i - (\alpha + \beta x_i)) = 0\\ \frac{\partial L}{\partial \beta} &= -\frac{2}{K}\sum_{i=1}^{K} x_i (y_i - (\alpha + \beta x_i)) = 0 \end{align}\]
although a closed form expression exists for \(\hat{\alpha}\) and \(\hat{\beta}\), for more general models, one doesn’t exist \(\implies\) use gradient descent optimisation
\[\begin{align} \alpha &= \alpha - \eta \frac{\partial L}{\partial \alpha}\\ \beta &= \beta - \eta \frac{\partial L}{\partial \beta} \end{align}\]
until \(\alpha\) and \(\beta\) no longer change. \(\eta\) is the learning rate
alternative model:
\[\begin{equation} y_i|x_i \sim \mathcal{N}(f_p(x_i), \sigma), \end{equation}\]
where
\[\begin{equation} f_p(x_i) = \theta_0 + \theta_1 x_i + \theta_2 x_i^2 + ... + \theta_p x_i^p \end{equation}\]
model is better able to fit more complex datasets
\[\begin{equation} L = C||\theta||_q + \frac{1}{K} \sum_{i=1}^{K} (y_i - f_p(x_i))^2 \end{equation}\]
where \(||.||_q\) denotes the \(Lq\) norm: different choices can yield very different estimates
\[\begin{equation} y_i \sim \text{Bernoulli}(\theta_i) \end{equation}\]
where \(0\leq \theta_i \leq 1 = Pr(y_i=1)\)
is given by:
\[\begin{equation} \text{Pr}(y_i|\theta_i) = \theta_i^{y_i} (1 - \theta_i)^{1 - y_i} \end{equation}\]
so that \(\text{Pr}(y_i=1) = \theta_i\) and \(\text{Pr}(y_i=0) = 1 - \theta_i\)
In logistic regression, we use logistic function:
\[\begin{equation} \theta_i = f_\beta(x_i) := \frac{1}{1 + \exp (-(\beta_0 + \beta_1 x_i))} \end{equation}\]
assume data are i.i.d., the likelihood is:
\[\begin{equation} L=p(\boldsymbol{y}|\beta,\boldsymbol{x}) = \prod_{i=1}^{K} f_\beta(x_i)^{y_i} (1 - f_\beta(x_i))^{1 - y_i} \end{equation}\]
setting priors, we have a Bayesian model:
\[\begin{equation} p(\beta|\boldsymbol{x}, \boldsymbol{y}) \propto p(\beta) p(\boldsymbol{y}|\beta,\boldsymbol{x}) \end{equation}\]
so can use (say) Markov chain Monte Carlo to estimate model with uncertainty. Frequentists use gradient descent to find maximum likelihood estimates instead
straightforward to extend the model to incorporate multiple regressions:
\[\begin{equation} f_\beta(x_i) := \frac{1}{1 + \exp (-(\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i}))} \end{equation}\]
But how to interpret parameters of logistic regression?
another way of writing logistic function:
\[\begin{align} f_\beta(x_i) &= \frac{1}{1 + \exp (-(\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i}))}\\ &= \frac{\exp (\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i})}{1 + \exp (\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i})} \end{align}\]
so that
\[\begin{align} 1 - f_\beta(x_i) = \frac{1}{1 + \exp (\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i})} \end{align}\]
taking the ratio:
\[\begin{equation} \text{odds} = \frac{f_\beta(x_i)}{1-f_\beta(x_i)} = \exp (\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i}) \end{equation}\]
so that
\[\begin{equation} \log\text{odds} =\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i} \end{equation}\]
meaning (say) \(\beta_1\) represents the change in log-odds for a one unit change in \(x_{1}\)
for new data point \(\tilde x_i\):
for new data point \(\tilde x_i\):
many options possible. Common metrics include:
\[\begin{equation} s(x_1,x_2) = \frac{x_1.x_2}{|x_1||x_2|} \end{equation}\]
assume
\[\begin{equation} \boldsymbol{x} \sim \mathcal{N}(0, I) \end{equation}\]
where \(I\in\mathbb{R}^d\). What does the distribution of Euclidean distances between points look like as \(d\) changes?
| patient | breathing issue (B) | high temp (T) | loss taste (L) | covid |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 0 | 1 | 1 |
| 3 | 0 | 0 | 1 | 1 |
| 4 | 0 | 1 | 1 | 0 |
| 5 | 0 | 1 | 0 | 1 |
\[\begin{equation} f(\boldsymbol{x}):=\text{Pr}(C=1|\boldsymbol{x}) := \frac{1}{|\mathcal{S}(\boldsymbol{x})|} \sum_{i\in \mathcal{S}(\boldsymbol{x})} C_i \end{equation}\]
where \(\mathcal{S}(\boldsymbol{x})\) is the set of individuals with symptoms \(\boldsymbol{x}\)
\[\begin{align} f(\emptyset) &= \frac{3}{5}\\ f(B=1) &= \frac{1}{2}\\ f(B=1,T=1) &= 0\\ \end{align}\]
| patient | breathing issue (B) | high temp (T) | loss taste (L) | covid |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 0 | 1 | 1 |
| 3 | 0 | 0 | 1 | 1 |
| 4 | 0 | 1 | 1 | 0 |
| 5 | 0 | 1 | 0 | 1 |
\[\begin{equation} H = -\sum_{i=1}^{2} p_i \log p_i = -(p \log p + (1 - p) \log (1-p)) \end{equation}\]
define \(H(D|a)\) as the conditional entropy after splitting on a given variable \(a\) for training data \(D\):
\[\begin{equation} H(D|a) = \sum_{v\in\text{vals}(a)} \frac{|\mathcal{S}(v)|}{|\mathcal{S}(\emptyset)|} H(\mathcal{S}(v)) \end{equation}\]
when we start, we have entropy:
\[\begin{equation} H(\phi) = -3/5\log (3/5) - 2/5 \log (2/5) \approx 0.97 \end{equation}\]
\[\begin{align} H(D|B) &= 3/5 H(B=0) + 2/5 H(B=1)\\ &= 3/5 (-2/3\log(2/3) - 1/3\log(1/3))\\ \;\;& + 2/5 (-1/2\log(1/2) - 1/2\log(1/2)) \approx 0.95 \end{align}\]
| patient | breathing issue (B) | high temp (T) | loss taste (L) | covid |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 0 | 1 | 1 |
| 3 | 0 | 0 | 1 | 1 |
| 4 | 0 | 1 | 1 | 0 |
| 5 | 0 | 1 | 0 | 1 |
\[\begin{align} H(D|T) &= 2/5 H(T=0) + 3/5 H(T=1)\\ &= 2/5 (0)\\ \;\;& + 3/5 (-2/3\log(2/3) - 1/3\log(1/3)) \approx 0.55 \end{align}\]
| patient | breathing issue (B) | high temp (T) | loss taste (L) | covid |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 0 | 1 | 1 |
| 3 | 0 | 0 | 1 | 1 |
| 4 | 0 | 1 | 1 | 0 |
| 5 | 0 | 1 | 0 | 1 |
\[\begin{align} H(D|L) &= 1/5 H(L=0) + 4/5 H(L=1)\\ &= 1/5 (0)\\ \;\;& + 4/5 (-1/2\log(1/2) - 1/2\log(1/2)) = 0.8 \end{align}\]
| patient | breathing issue (B) | high temp (T) | loss taste (L) | covid |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 0 | 1 | 1 |
| 3 | 0 | 0 | 1 | 1 |
| 4 | 0 | 1 | 1 | 0 |
| 5 | 0 | 1 | 0 | 1 |
initial entropy \(\approx\) 0.97. After splitting:
so splitting on \(T\) is optimal
tree still defined as:
\[\begin{equation} f(x) := \frac{1}{|\mathcal{S}(\boldsymbol{x})|} \sum_{i\in \mathcal{S}(\boldsymbol{x})} y_i \end{equation}\]
but \(y_i\in\mathbb{R}\) and split based on reduction of standard deviation in \(y\) opposed to entropy
\[\begin{equation} f(\boldsymbol{x}) = \text{mode} \bigcup\limits_{b=1}^{B} f_b(\boldsymbol{x}) \end{equation}\]
where \(f_b(\boldsymbol{x})\) is a decision tree trained on a random sample (with replacement) drawn from original training set; there are \(B\) such samples. Process known as “bagging”
estimating mean height of individuals in population
assume start with naive model:
\[\begin{equation} f_0 = \frac{1}{K}\sum_{i=1}^{K} y_i \end{equation}\]
calculate residuals:
\[\begin{equation} \hat{y_i} = y_i - f_0 \end{equation}\]
train new model \(f(x_i)\) on \(\hat{y_i}\) \(\implies f_1\)
new predictor becomes:
\[\begin{equation} f(x_i) := f_0 + \alpha f_1(x_i) \end{equation}\]
where \(\alpha\) is learning rate (a hyperparameter). Now calculate new residuals:
\[\begin{equation} \hat{y_i} = y_i - f(x_i) \end{equation}\]
and fit second decision tree to it; and so on. Iterative process of improving models known as boosting
boosting parameters:
tree parameters: